Figure Text Extraction in Biomedical Literature
نویسندگان
چکیده
BACKGROUND Figures are ubiquitous in biomedical full-text articles, and they represent important biomedical knowledge. However, the sheer volume of biomedical publications has made it necessary to develop computational approaches for accessing figures. Therefore, we are developing the Biomedical Figure Search engine (http://figuresearch.askHERMES.org) to allow bioscientists to access figures efficiently. Since text frequently appears in figures, automatically extracting such text may assist the task of mining information from figures. Little research, however, has been conducted exploring text extraction from biomedical figures. METHODOLOGY We first evaluated an off-the-shelf Optical Character Recognition (OCR) tool on its ability to extract text from figures appearing in biomedical full-text articles. We then developed a Figure Text Extraction Tool (FigTExT) to improve the performance of the OCR tool for figure text extraction through the use of three innovative components: image preprocessing, character recognition, and text correction. We first developed image preprocessing to enhance image quality and to improve text localization. Then we adapted the off-the-shelf OCR tool on the improved text localization for character recognition. Finally, we developed and evaluated a novel text correction framework by taking advantage of figure-specific lexicons. RESULTS/CONCLUSIONS The evaluation on 382 figures (9,643 figure texts in total) randomly selected from PubMed Central full-text articles shows that FigTExT performed with 84% precision, 98% recall, and 90% F1-score for text localization and with 62.5% precision, 51.0% recall and 56.2% F1-score for figure text extraction. When limiting figure texts to those judged by domain experts to be important content, FigTExT performed with 87.3% precision, 68.8% recall, and 77% F1-score. FigTExT significantly improved the performance of the off-the-shelf OCR tool we used, which on its own performed with 36.6% precision, 19.3% recall, and 25.3% F1-score for text extraction. In addition, our results show that FigTExT can extract texts that do not appear in figure captions or other associated text, further suggesting the potential utility of FigTExT for improving figure search.
منابع مشابه
DeTEXT: A Database for Evaluating Text Extraction from Biomedical Literature Figures
Hundreds of millions of figures are available in biomedical literature, representing important biomedical experimental evidence. Since text is a rich source of information in figures, automatically extracting such text may assist in the task of mining figure information. A high-quality ground truth standard can greatly facilitate the development of an automated system. This article describes De...
متن کاملExtraction of Drug-Drug Interaction from Literature through Detecting Linguistic-based Negation and Clause Dependency
Extracting biomedical relations such as drug-drug interaction (DDI) from text is an important task in biomedical NLP. Due to the large number of complex sentences in biomedical literature, researchers have employed some sentence simplification techniques to improve the performance of the relation extraction methods. However, due to difficulty of the task, there is no noteworthy improvement in t...
متن کاملClassification of Figures in Biomedical Literature toward a Figure Finding System
As biomedical full-text papers are becoming more available in digitized form on-line, there is a need for tools to mine information from all parts in the papers. Notably, since figures and their legends/captions in biomedical papers provide important information about research outcomes, mining techniques targeting them have attracted a great deal of attention. However, even a simple-sounding ta...
متن کاملBiomedical Ontologies and Text Mining for Biomedicine and Healthcare: A Survey
In this survey paper, we discuss biomedical ontologies and major text mining techniques applied to biomedicine and healthcare. Biomedical ontologies such as UMLS are currently being adopted in text mining approaches because they provide domain knowledge for text mining approaches. In addition, biomedical ontologies enable us to resolve many linguistic problems when text mining approaches handle...
متن کاملAutomatic identification of ROI in figure images toward improving hybrid (text and image) biomedical document retrieval
Biomedical images are often referenced for clinical decision support (CDS), educational purposes, and research. They appear in specialized databases or in biomedical publications and are not meaningfully retrievable using primarily textbased retrieval systems. The task of automatically finding the images in an article that are most useful for the purpose of determining relevance to a clinical s...
متن کامل